Example Single Cell RNA-seq Analysis

Author

Kazi Tanvir Hasan

Published

March 19, 2023

Attaching SeuratObject

Import Seurat Object

Data QC and Inspection

After creating the Seurat object, the next step is to do quality control on the data. The most common quality control is to filter out

  1. Cells with too few genes detected. They usually represent cells which are not sequenced deep enough for reliable characterization.

  2. Cells with too many genes detected. They may represent doublets or multiplets (i.e. two or more cells in the same droplet, therefore sharing the same cell barcode).

  3. Cells with high mitochondrial transcript percentage. As most of the scRNA-seq experiments use oligo-T to capture mRNAs, mitochondrial transcripts should be relatively under-representative due to their lack of poly-A tails, but it is unavoidable that some mitochondrial transcripts are captured. Meanwhile, there is also some evidence that stable poly-A tails exist in some mitochondrial transcripts but serve as a marker for degradation (e.g. this paper). Together, cells with high mitochondrial transcript percentage likely represent cells under stress (e.g. hypoxia) which produce a lot of mitochondria, or which produce an abnormally high amount of truncated mitochondrial transcripts.

While numbers of detected genes are summarized by Seurat automatically when creating the Seurat object (with nFeature_RNA being the number of detected genes/features; nCount_RNA being the number of detected transcripts), one needs to calculate mitochondial transcript percentages manually. Still, Seurat provides an easy solution

Please note that there is no one-size-fits-all filtering criteria, as the normal ranges of these metrics can vary dramatically from one experiment to another, depending on sample origin as well as reagents and sequencing depths. One suggestion here is to ONLY FILTER OUT OUTLIER CELLS, i.e. the minority of cells with certain QC metrics clearly above or below the majority of cells. To do that, one needs to first know how these values are distributed in the data. One can look at the distribution by creating a violin plot for each of the metrics.

Or if you don’t like the dots (individual cells)

And as one would expect, number of detected genes and number of detected transcripts are well correlated across cells while mitochondrial transcript percentage is not.

P.S. patchwork is an R package developed to facilitate layout of plots produced by ggplot2 (Seurat uses ggplot2 to produce plots if you use the plotting functions in the Seurat package). Without patchwork, it is illegal to run plot1 + plot2.

Due to the correlation of gene number and transcript number, we only need to set a cutoff to either one of these metrics, combined with an upper threshold of mitochondrial transcript percentage, for the QC. For instance, for this data set, a detected gene number between 500 and 5000, and a mitochondrial transcript percentage lower than 5% would be quite reasonable, but it is fine to use different thresholds.

It is worth to mention that sometimes more QC may need to be applied. One potential issue is the presence of doublets. As the amount of captured RNA varies a lot from cell to cell, doublets don’t always show a higher number of detected genes or transcripts. There are several tools available now, which are designed to predict whether a ‘cell’ is indeed a singlet or actually a doublet/multiplet. DoubletFinder, for instance, predicts doublets by first constructing artificial doublets by randomly averaging cells in the data, and then for each cell testing whether it is more similar to the artificially doublets or not. This helps with the decision whether a cell is likely a doublet or not. Similarly, mitochondrial transcript percentage may not be sufficient to filter out stressed or unhealthy cells. Sometimes one would needs to do extra filtering, e.g. based on the machine learning based prediction.

Normalization and Data scaling

Normalization

Similar to bulk RNA-seq, the amount of captured RNA is different from cell to cell, and one should therefore not directly compare the number of captured transcripts for each gene between cells. A normalization step, aiming to make gene expression levels between different cells comparable, is therefore necessary. The most commonly used normalization in scRNA-seq data analysis is very similar to the concept of TPM (Transcripts Per Million reads) - one normalizes the feature expression measurements for each cell to the total expression, and then multiplies this by a scale factor (10000 by default). At the end, the resulting expression levels are log-transformed so that the expression values better fit a normal distribution. It is worth to mention that before doing the log-transformation, one pseudocount is added to every value so that genes with zero transcripts detected in a cell still present values of zero after log-transform.

Feature selection

The biggest advantage of single-cell as compared to bulk RNA-seq is the potential to look into cellular heterogeneity of samples, by looking for cell groups with distinct molecular signatures. However, not every gene has the same level of information and the same contribution when trying to identify different cell groups. For instance, genes with low expression levels, and those with similar expression levels across all cells, are not very informative and may dilute differences between distinct cell groups. Therefore, it is necessary to perform a proper feature selection before further exploring the scRNA-seq data.

In Seurat, or more general in scRNA-seq data analysis, this step usually refers to the identification of highly variable features/genes, which are genes with the most varied expression levels across cells.

VST function calculates a variance stabilizing transformation (VST) from the fitted dispersion-mean relation(s) and then transforms the count data (normalized by division by the size factors or normalization factors), yielding a matrix of values which are now approximately homoskedastic (having constant variance along the range of mean values). The transformation also normalizes with respect to library size. The rlog is less sensitive to size factors, which can be an issue when size factors vary widely. These transformations are useful when checking for outliers or as input for machine learning techniques such as clustering or linear discriminant analysis.

By default, Seurat calculates the standardized variance of each gene across cells, and picks the top 2000 ones as the highly variable features. One can change the number of highly variable features easily by giving the nfeatures option (here the top 3000 genes are used).

There is no good criteria to determine how many highly variable features to use. Sometimes one needs to go through some iterations to pick the number that gives the most clear and interpretable result. Most often, a value between 2000 to 5000 is OK and using a different value doesn’t affect the results too much.

 [1] "CTGF"         "CTC-340A15.2" "NPY"          "KIF26B"       "SRGN"        
 [6] "CCL3"         "CNR1"         "RELN"         "AJ006998.2"   "AC011288.2"  
When using repel, set xnudge and ynudge to 0 for optimal results
Warning: Transformation introduced infinite values in continuous x-axis

Warning: Transformation introduced infinite values in continuous x-axis

Data Scaling

Since different genes have different base expression levels and distributions, the contribution of each gene to the analysis is different if no data transformation is performed. This is something we do not want as we don’t want our analysis to only depend on genes that are highly expressed. Therefore a scaling is applied to the data using the selected features, just like one usually does in any data science field.

Centering and scaling data matrix

Data Clustering (PCA/UMAP)

In principle one can start to look at cell heterogeneity after identifying highly variable genes and scaling the data. However, applying a linear dimension reduction before doing any further analysis is strongly recommended and sometimes even seen as compulsory. The benefit of doing such a dimension reduction includes but is not limited to:

  1. The data becomes much more compact so that computation becomes much faster.

  2. As scRNA-seq data is intrinsically sparse, summarizing measurements of related features greatly enhances the signal robustness.

PC_ 1 
Positive:  SYT1, RASGEF1B, GRIN1, RBFOX1, MT-CO1, MEG3, MT-ND1, MT-CO3, SLC26A3, DLGAP1 
       SNAP25, STXBP5L, CACNA1B, BCYRN1, SYN2, INO80D, PHYHIP, MT-CO2, CSMD1, DNM1 
       LINC01609, ENC1, MT-ATP6, SLC17A7, KCNQ5, WBSCR17, IQCJ-SCHIP1, CHD5, MT-ND2, RYR2 
Negative:  PLP1, SLC1A3, ST18, APOE, CST3, ERBB4, NEAT1, MIR219A2, MT2A, APOD 
       SLC1A2, PREX2, RNF219-AS1, B2M, ADGRV1, LINC01608, MOBP, SPP1, ATP1A2, LSAMP-AS1 
       CNDP1, FGFR3, AQP4, NHSL1, PTPRZ1, LINC00499, EMX2, COL5A3, ETNPPL, AC002429.5 
PC_ 2 
Positive:  PTPRD, ST18, APOD, PLP1, LHFPL3, CNTNAP2, MIR219A2, LRRTM4, NXPH1, OPCML 
       PCDH15, MOBP, LINC01608, GRIK1, SGCZ, LUZP2, RP4-668E10.4, MMP16, ADARB2, SEMA5A 
       GRIA4, PDE1A, CA10, LINC01170, SMOC1, CNDP1, GRIK2, SPP1, KAZN, TNR 
Negative:  ADGRV1, RNF219-AS1, GPC5, SLC1A2, TPD52L1, LINC00499, AQP4, SLC4A4, BMPR1B, EMX2 
       COL5A3, SLC1A3, TRPM3, NKAIN3, GLIS3, ETNPPL, PAMR1, FGFR3, NHSL1, MT2A 
       STON2, APOE, CLU, AC002429.5, RANBP3L, GNA14, MGST1, GJA1, ATP1A2, SLCO1C1 
PC_ 3 
Positive:  PTPRZ1, NXPH1, PCDH15, SOX6, LUZP2, GRIK1, LHFPL3, VCAN, SGCZ, MMP16 
       RP4-668E10.4, BRINP3, ERBB4, GRIK2, OPCML, TNR, LINC01322, KCNMB2-AS1, NKAIN3, SEMA5A 
       KAZN, XKR4, KIAA1211, GPC6, SMOC1, CA10, COL11A1, CSMD1, ADARB2, MIR3681HG 
Negative:  ST18, PLP1, MOBP, NKAIN2, LINC01608, SPP1, MIR219A2, LPAR6, ADAM28, PDE1A 
       CNDP1, DOCK8, FYB, APBB1IP, NPTX2, CD74, BDNF, MAML3, SRGN, NPAS4 
       PTPRC, EGR4, INPP5D, SYK, BLNK, HLA-DRA, OLR1, CX3CR1, ATP8B4, SAMSN1 
PC_ 4 
Positive:  GAD2, ARL4C, GAD1, LINC00486, IGF1, ADARB2, GRIK1, FOS, RP11-123O10.4, DOCK8 
       ERBB4, DLX6-AS1, BTBD11, KCNIP1, MTRNR2L8, ADAM28, ZNF385D, GRIP1, DLX1, SDK1 
       SCG2, EGR1, LPAR6, APBB1IP, CD74, SOX1, FOSB, FYB, RP11-154D17.1, LINC01609 
Negative:  AC011288.2, LDB2, PTPRD, RALYL, ST6GALNAC5, NKAIN2, IQCJ-SCHIP1, RP11-380P13.1, LY86-AS1, FAM19A1 
       SV2B, AC114765.1, PDE1A, LINC01250, MLIP, KCNIP4, ANO3, CBLN2, CADPS2, OLFM3 
       FSTL4, SLIT3, CHN1, MKL2, RP11-191L9.4, AC067956.1, AJ006998.2, MIR137HG, HECW1, PHACTR1 
PC_ 5 
Positive:  DOCK8, ADAM28, ST6GAL1, LPAR6, FYB, APBB1IP, CD74, SRGN, SYK, SRGAP2B 
       C10orf11, INPP5D, PTPRC, RP11-624C23.1, SRGAP2, CSF2RA, HLA-DRA, HS3ST4, P2RY12, BLNK 
       OLR1, ATP8B4, ARHGAP15, SAMSN1, A2M, CX3CR1, TBXAS1, CSF3R, CD86, RP11-480C22.1 
Negative:  PLP1, ST18, MIR219A2, MOBP, ARC, LINC01608, FOSB, NPAS4, EGR4, MTRNR2L8 
       CNDP1, EGR1, ERBB4, JUND, NPTX2, EGR3, PIM3, KIRREL3, LINC01170, DUSP5 
       INHBA, NKAIN2, RP11-154D17.1, APOD, RP11-81H3.2, SLC27A6, MIDN, KDM6B, BCOR, LBH 
PC_ 1 
Positive:  SYT1, RASGEF1B, GRIN1, RBFOX1, MT-CO1 
Negative:  PLP1, SLC1A3, ST18, APOE, CST3 
PC_ 2 
Positive:  PTPRD, ST18, APOD, PLP1, LHFPL3 
Negative:  ADGRV1, RNF219-AS1, GPC5, SLC1A2, TPD52L1 

https://www.rdocumentation.org/packages/jackstraw/versions/1.3/topics/jackstraw

Test for association between the observed data and their systematic patterns of variations. Systematic patterns may be captured by latent variables using principal component analysis (PCA), factor analysis (FA), and related methods. The jackstraw enables statistical testing for association between observed variables and latent variables, as captured by PCs or other estimates.

Warning: Removed 21000 rows containing missing values (`geom_point()`).

We chose 10 here, but encourage users to repeat downstream analyses with a different number of PCs (10, 15, or even 50!). As you will observe, the results often do not differ dramatically.

We advise users to err on the higher side when choosing this parameter. For example, performing downstream analyses with only 5 PCs does significantly and adversely affect results.

Computing nearest neighbor graph
Computing SNN
Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck

Number of nodes: 8970
Number of edges: 296870

Running Louvain algorithm...
Maximum modularity in 10 random starts: 0.9382
Number of communities: 15
Elapsed time: 0 seconds
AAACCTGAGTCACGCC_Ct5_Ct6 AAACCTGCATGGATGG_Ct5_Ct6 AAACCTGGTAGCTAAA_Ct5_Ct6 
                       1                        0                        0 
AAACCTGGTGCTAGCC_Ct5_Ct6 AAACCTGGTGTCGCTG_Ct5_Ct6 
                      13                        1 
Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

A linear dimension reduction has both pros and cons. The good side is that every PC is a linear combination of gene expression so interpretation of PCs are straightforward. Also the data is compressed but not disorted, therefore information in the data is largely remained. The bad side, on the other hand, is that one usually needs more than 10 PCs to cover most of the information. This is fine for most of the analysis, but not for visualization where it is impossible to go over three dimensions for ordinary persons.

The most commonly used non-linear dimension reduction methods in scRNA-seq data analysis are t-distributed Stochastic Neighbor Embedding (t-SNE) and Uniform Manifold Approximation and Projection (UMAP). In this example analysis I only use umap.

Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
This message will be shown once per session
15:20:20 UMAP embedding parameters a = 0.9922 b = 1.112
15:20:20 Read 8970 rows and found 10 numeric columns
15:20:20 Using Annoy for neighbor search, n_neighbors = 30
15:20:20 Building Annoy index with metric = cosine, n_trees = 50
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
15:20:21 Writing NN index file to temp file /var/folders/xn/zm5z4n5j6t5fbn1m6pzwtqsc0000gn/T//RtmpZPvAHR/filec3e5bfd4256
15:20:21 Searching Annoy index using 1 thread, search_k = 3000
15:20:23 Annoy recall = 100%
15:20:23 Commencing smooth kNN distance calibration using 1 thread with target n_neighbors = 30
15:20:23 Initializing from normalized Laplacian + noise (using irlba)
15:20:24 Commencing optimization for 500 epochs, with 369150 positive edges
15:20:37 Optimization finished

Markers Identification

For a more efficient implementation of the Wilcoxon Rank Sum Test,
(default method for FindMarkers) please install the limma package
--------------------------------------------
install.packages('BiocManager')
BiocManager::install('limma')
--------------------------------------------
After installation of limma, Seurat will automatically use the more 
efficient implementation (no further action necessary).
This message will be shown once per session
        p_val avg_log2FC pct.1 pct.2 p_val_adj
PLPP3       0       1.65 0.512 0.098         0
NFIA        0       1.50 0.874 0.360         0
CACHD1      0       1.92 0.556 0.106         0
C1orf61     0       2.00 0.805 0.227         0
ATP1A2      0       2.27 0.686 0.086         0

        p_val avg_log2FC pct.1 pct.2 p_val_adj
STXBP5L     0       1.96 0.886 0.371         0
KCNIP4      0       1.87 0.969 0.511         0
KCNQ5       0       1.99 0.761 0.263         0
SYT1        0       1.79 0.957 0.398         0
MEG3        0       1.67 0.992 0.605         0

         p_val avg_log2FC pct.1 pct.2 p_val_adj
CHRM3        0       3.02 0.765 0.051         0
INO80D       0       3.83 0.663 0.058         0
SYNPR        0       3.60 0.765 0.028         0
ROBO2        0       4.23 0.925 0.093         0
RASGEF1B     0       5.71 0.683 0.036         0

Calculating cluster 0
Calculating cluster 1
Calculating cluster 2
Calculating cluster 3
Calculating cluster 4
Calculating cluster 5
Calculating cluster 6
Calculating cluster 7
Calculating cluster 8
Calculating cluster 9
Calculating cluster 10
Calculating cluster 11
Calculating cluster 12
Calculating cluster 13
Calculating cluster 14

Warning in FetchData.Seurat(object = object, vars = c(dims, "ident", features),
: The following requested variables were not found: MS4A1, FCER1A, PPBP
Warning: CombinePlots is being deprecated. Plots should now be combined using
the patchwork system.

# A tibble: 150 × 7
# Groups:   cluster [15]
   p_val avg_log2FC pct.1 pct.2 p_val_adj cluster gene     
   <dbl>      <dbl> <dbl> <dbl>     <dbl> <fct>   <chr>    
 1     0       4.31 0.991 0.219         0 0       PLP1     
 2     0       3.98 0.911 0.107         0 0       ST18     
 3     0       3.72 0.963 0.285         0 0       CTNNA3   
 4     0       3.50 0.962 0.271         0 0       MBP      
 5     0       3.49 0.853 0.108         0 0       TMEM144  
 6     0       3.25 0.609 0.045         0 0       LINC01608
 7     0       3.19 0.748 0.113         0 0       MIR219A2 
 8     0       3.13 0.633 0.059         0 0       MOBP     
 9     0       3.09 0.681 0.092         0 0       TF       
10     0       3.08 0.696 0.1           0 0       CLDN11   
# … with 140 more rows
Warning: CombinePlots is being deprecated. Plots should now be combined using
the patchwork system.

R version 4.2.1 (2022-06-23)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur ... 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] lubridate_1.9.2    forcats_1.0.0      stringr_1.5.0      dplyr_1.1.0       
 [5] purrr_1.0.1        readr_2.1.4        tidyr_1.3.0        tibble_3.2.0      
 [9] ggplot2_3.4.1      tidyverse_2.0.0    patchwork_1.1.2    SeuratObject_4.1.3
[13] Seurat_4.3.0       conflicted_1.2.0  

loaded via a namespace (and not attached):
  [1] Rtsne_0.16             colorspace_2.1-0       deldir_1.0-6          
  [4] ellipsis_0.3.2         ggridges_0.5.4         rstudioapi_0.14       
  [7] spatstat.data_3.0-1    farver_2.1.1           leiden_0.4.3          
 [10] listenv_0.9.0          ggrepel_0.9.3          fansi_1.0.4           
 [13] codetools_0.2-19       splines_4.2.1          cachem_1.0.7          
 [16] knitr_1.42             polyclip_1.10-4        jsonlite_1.8.4        
 [19] ica_1.0-3              cluster_2.1.4          png_0.1-8             
 [22] uwot_0.1.14            shiny_1.7.4            sctransform_0.3.5     
 [25] spatstat.sparse_3.0-1  compiler_4.2.1         httr_1.4.5            
 [28] Matrix_1.5-3           fastmap_1.1.1          lazyeval_0.2.2        
 [31] cli_3.6.0              later_1.3.0            htmltools_0.5.4       
 [34] tools_4.2.1            igraph_1.4.1           gtable_0.3.2          
 [37] glue_1.6.2             RANN_2.6.1             reshape2_1.4.4        
 [40] Rcpp_1.0.10            scattermore_0.8        vctrs_0.5.2           
 [43] nlme_3.1-162           spatstat.explore_3.1-0 progressr_0.13.0      
 [46] lmtest_0.9-40          spatstat.random_3.1-4  xfun_0.37             
 [49] globals_0.16.2         timechange_0.2.0       mime_0.12             
 [52] miniUI_0.1.1.1         lifecycle_1.0.3        irlba_2.3.5.1         
 [55] goftest_1.2-3          future_1.32.0          MASS_7.3-58.3         
 [58] zoo_1.8-11             scales_1.2.1           hms_1.1.2             
 [61] promises_1.2.0.1       spatstat.utils_3.0-2   parallel_4.2.1        
 [64] RColorBrewer_1.1-3     yaml_2.3.7             memoise_2.0.1         
 [67] reticulate_1.28        pbapply_1.7-0          gridExtra_2.3         
 [70] stringi_1.7.12         rlang_1.1.0            pkgconfig_2.0.3       
 [73] matrixStats_0.63.0     evaluate_0.20          lattice_0.20-45       
 [76] ROCR_1.0-11            tensor_1.5             labeling_0.4.2        
 [79] htmlwidgets_1.6.2      cowplot_1.1.1          tidyselect_1.2.0      
 [82] parallelly_1.34.0      RcppAnnoy_0.0.20       plyr_1.8.8            
 [85] magrittr_2.0.3         R6_2.5.1               generics_0.1.3        
 [88] DBI_1.1.3              withr_2.5.0            pillar_1.8.1          
 [91] fitdistrplus_1.1-8     survival_3.5-5         abind_1.4-5           
 [94] sp_1.6-0               future.apply_1.10.0    KernSmooth_2.23-20    
 [97] utf8_1.2.3             spatstat.geom_3.1-0    plotly_4.10.1         
[100] tzdb_0.3.0             rmarkdown_2.20         grid_4.2.1            
[103] data.table_1.14.8      digest_0.6.31          xtable_1.8-4          
[106] httpuv_1.6.9           munsell_0.5.0          viridisLite_0.4.1